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PATENT 

METHOD AND APPARATUS FOR COMPRESSING VLIW 
INSTRUCTION AND SHARING SUBSTRUCTIONS 

BACKGROUND OF THE INVENTION 

This invention relates to very long instruction word (VLIW) computing 
architectures, and more particularly to methods and apparatus for reducing storage 
requirements of VLIW instructions. 

Multimedia computing applications such as image processing are more 
efficiently implemented using parallel structures for handling multiple streams of data. 
VLIW processors, such as the TMS320C6x manufactured by Texas Instruments of 
Dallas Texas and the MAP 1000 manufactured by Hitachi Ltd. of Tokyo Japan and 
Equator Technologies of Campbell California, support a large degree of parallelism of 
both data streams and instruction streams to implement parallel or pipelined execution of 
program instructions. 

A VLIW processor includes one or more of multiple homogeneous processing 
blocks referred to as clusters. Each cluster includes a common number of functional 
processing units. A VLIW instruction includes multiple subinstruction fields. The size 
of the VLIW instruction grows linearly with the number of parallel operations being 
defined concurrently in the subinstruction fields. The subinstructions present in an 
instruction are distributed among functional processing units for parallel execution. 

Conventional VLIW processors typically execute fewer than ten operations per 
instruction. The number of concurrent executions is likely to increase substantially in 
future media processors with instructions likely to be 256 or 512 bits wide. As the size 
of the instruction increases, however, a correspondingly increased burden on the data 
flow and memory structures occurs. To provide enough instruction fetch bandwidth, 
the VLIW instructions typically are fetched first from external memory and stored in an 
on-chip instruction cache before being executed. Thrashing of the cache (i.e., cyclic 
misses), for example, during a tight processing loop is very undesirable resulting in 
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degraded performance. Accordingly, it is increasingly desirable to manage the 
instruction cache effectively to sustain a desired high processing throughput. 

At the same time, the need for a larger instruction cache increases as the clock 
frequency of the processor increases, as wider VLIW architectures are adapted, and as 
5 more complex algorithms are developed. Accordingly, there is a need for methods of 

efficiently handling and caching VLIW instructions. 

SUMMARY OF THE INVENTION 

According to the invention, subinstructions of a VLIW instruction are shared 

10 among functional processing units to reduce the size of the VLIW instruction as stored 

in instruction cache, and in some embodiments, main memory. Specifially, the VLIW 
instruction is compressed in cases of substruction sharing. In some embodiments the 
instruction is compressed at compilation time and stored in main memory in compressed 
format. In other embodiments the instruction is stored in main memory in 

15 uncompressed format and compressed before being stored in cache memory. 

According to one aspect of the invention, a set of instruction-compression 
control bits are associated with each VLIW instruction. In one embodiment the VLIW 
instruction is formatted to include the set of control bits within the instruction. A VLIW 
instruction includes a plurality of subinstruction fields, the set of instruction- 

20 compression control bits, and other miscellaneous control bits, such a$ those describing 

the locations of NOP instructions, (i.e., empty). 

In a fully expanded format, a VLIW instruction includes a prescribed number 
of subinstruction fields, where the number of fields is determined by the architecture of 
the processor executing the VLIW instruction. Some of the subinstruction fields may 

25 be NOP instructions. Further some subinstruction fields may include the same 

subinstruction as in other subinstruction fields. It is known to compress an instruction 
to remove the space allocated for the NOP instructions. According to this invention a 
scheme is provided to reduce the redundancies of subinstructions in select cases. 

Consider an architecture in which an instruction includes four subinstruction 

30 fields. There are 15 situations of interest for such an instruction. In one situation there 

are no redundant subinstructions, (e.g., ABCD). In the remaining situations there is 
some degree of redundancy among subinstructions, (e.g., AAAA, AAAB, AABA, 
ABAA, BAAA, A ABB, ABAB, ABBA, AABC, ABAC, ABC A, BAAC, BACA, 
BCAA). Note that A,B, C, and D are used in the sense of identifying whether a 

35 subinstruction is the same or different from another subinstruction in the field. One 
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skilled in the art will appreciate that there are many different substructions A. 
Similarly, there are many different subinstructions B, C and D. 

For an architecture in which there are more subinstruction fields, there are 
additional situations of redundant subinstructions. For any given architecture, there are 
5 not more than 2 Z possible situations, where 'z T is the number of subinstruction fields. 

To cover every redundancy situation, there would be as many as 'z 1 control bits, where 
z is the maximum number of subinstruction fields allowed in the processor architecture. 
In some embodiments all such situations are covered by including f z' control bits with 
each instruction. However, as the instruction width increases, it may be undesirable to 

10 add so many extra control bits for subinstruction sharing. In particular, the cost of so 

many bits may seem excessive when there tends to be a pattern among the 
subinstructions redundancies that come up over and over again in practice. As a result, 
in a preferred embodiment the number of control bits is reduced to less than 'z' to 
handle a prescribed number of the approximately 2 Z subinstruction sharing situations 

15 possible. 

Different processors can be designed for different applications where the 
pattern of subinstruction redundancies varies. Further the cases of subinstruction 
redundancies covered for subinstruction sharing may be strategically selected for a 
given processor to have greatest impact on those applications for which the processor is 

20 targeted, (e.g., for image processing applications). 

Although any subinstruction reduncy situation is potentially covered and 
designed into the processor architecture, in one strategy all or less than all 
subinstruction sharing possibilities are covered. In one embodiment subinstruction 
sharing is provided for redundant subinstructions destined for corresponding functional 

25 processing units. A functional processing unit is a part of a processor. A processor 

includes 'z' functional processing units, where 'z' is the maximum number of 
subinstructions in an instruction. More specifically, however, a processor includes a 
plurality of clusters, in which each cluster includes a common number of functional 
processing units (FPU's). For each FPU in one cluster there is a corresponding FPU 

30 in each other cluster. Each corresponding FPU has the same functionalities. For 

example where there are three clusters of four FPU's, there are four sets of 
corresponding FPU's. In one strategy each permutation of subinstruction redundancy 
among any two or more corresponding FPU's are covered. For such example z=7, and 
there are 7 instruction-compression control bits per instruction. This is less than the 
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maximum number of control bits (e.g., 12) to cover every possible subinstruction 
sharing situation among the 12 functional processing units. 

According to another aspect of the invention, redundant substructions 
destined for a corresponding functional processing unit in each cluster are compressed. 
5 Specifically, when the same subinstruction is present in an instruction for the 

corresponding functional processing unit in at least two clusters, then according to this 
invention, only one copy of the substructions need be stored. The redundant 
subinstruction for the corresponding functional processing unit is omitted, resulting in a 
compressed instruction. For such a compression there is a corresponding condition of 

10 the instruction-compression control bits which identifies the redundant subinstruction 

fields sharing a specific subinstruction. 

According to another aspect of this invention, during compilation of a 
computer program for the VLIW processor, (e.g., compilation of higher language 
source code or assembly of assembler source code), the instruction-compression 

15 control bits are set to specify the conditions which define each compression for a given 

instruction. The instruction, including the instruction-compression control bits, is 
stored in memory in either compressed or uncompressed format. When stored in 
uncompressed format, the instruction is compressed before being stored in the 
processor's on-chip instruction cache. Accordingly, the instruction is compressed at 

20 any step between the main memory storage of the instruction and the on-chip instruction 

cache storage of the instruction, (e.g., it is compressed and restored in main memory; it 
is compressed and stored in a primary cache or secondary cache; it is compressed when 
moved to the on-chip instruction cache). 

According to another aspect of this invention, the condition of the instruction- 

25 compression control bits determines how the compressed instruction is to be 

decompressed for execution. In particular, the control bits determine how one or more 
substructions in the compressed instruction are to be shared among functional 
processing units of the VLIW processor for concurrent execution. The set of 
instruction-compression control bits identify one or more compression conditions in 

30 which redundant corresponding substructions are stored once, rather than 

redundantly. Each identified condition corresponds to a subinstruction to be shared by 
at least two corresponding functional processing units. 

The advantage of associating functional processing units of differing clusters 
as being corresponding functional processing units, and compressing the redundant 

35 substructions destined for such corresponding functional processing units is due to 



5 Docket No.: OT2.P59 

the regular program structure of image computing algorithms. Applicants have 
observed many tight loops in their program code for image computing library functions 
in which the same substructions are used in multiple clusters. For example in a 2D 
convolution function implemented on a VLIW processor having two clusters, the most 
5 frequently-used substructions are inner product and partitioned compaction of the 

inner product results. It has been observed that for most instructions that perform either 
of these substructions, multiple clusters are assigned the same substruction. 
Specifically in assembly code written for the MAP 1000 processor for such function, 
applicants observed that the tight loop program has 67 instructions out of 133 

10 instructions that consist of exactly the same substructions (including operands) for 

both clusters. Accordingly, the redundancy of substructions destined for 
corresponding functional processing units is a significant occurrence in important image 
processing algorithms. By avoiding the redundancy in VLIW instructions so that fewer 
instruction bits are needed if the same substruction is to be executed on multiple 

15 clusters, program size is reduced. In addition, efficiency of the instruction cache usage 

is improved, and the effective instruction fetch bandwidth is increased. 

These and other aspects and advantages of the invention will be better 
understood by reference to the following detailed description taken in conjunction with 
the accompanying drawings. 

20 

BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 is a block diagram of development and storage of a computer program 
having VLIW instructions; 

Fig. 2 is a partial block diagram of a computer system having a VLIW 
25 processor: 

Fig. 3 is a block diagram of a VLIW processor architecture; 

Fig. 4 is a diagram of a VLIW instruction format, identifying destinations for 
various substruction field contents for the processor of Fig, 3; 

Fig. 5 is a diagram of an exemplary uncompressed VLIW instruction; 
30 Fig. 6 is a diagram of a VLIW instruction to remove NOP substructions; 

Fig. 7 is a diagram of a VLIW instruction compressed according to an 
embodiment of this invention to implement substruction sharing; 

Fig. 8 is a diagram of the set of control bits included in the instruction of Fig. 
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Fig. 9 is a diagram of a multiplexing architecture for decoding the control bits 
of an instruction to determine various subinstruction sharing conditions; 

Figs. 10A-10E are diagrams of exemplary instructions showing an intended 
distribution of the instruction, the instruction with NOP compression, and the 
instruction in a format for subinstruction sharing; 

Fig. 1 1 is a flow chart of a method for setting the instruction-compression 
control bits; 

Fig. 12 is a flow chart of a method for compressing an instruction for 
subinstruction sharing; and 

Fig. 13 is a flow chart of a method for decoding the instruction-compression 
control bits to identify various subinstruction sharing conditions. 

DESCRIPTION OF SPECIFIC EMBODIMENTS 

Overview 

Fig. 1 shows a block diagram of program compilation and storage for a very 
long instruction word (VLIW) processor. The term "very long instruction word" 
VLIW is a term of art in the fields of computer system and processor architecture, 
parallel processing and image processing, generally referring to architectures in which a 
processor is able to handle an instruction which typically is 64 bits or longer, and 
consists of multiple substructions, 

A program engineer prepares, tests and debugs source code 12. The source 
code 12 is written in assembler language or a higher order programming language. The 
source code then is compiled/assembled by compiler/assembler 14 resulting in machine 
code 16. The machine code 16 is stored in memory 18 of a computer having a 
processor which is to execute the machine code 16. 

Referring to Fig. 2, a host computer 10 includes a very long instruction word 
(VLIW) processor 20, an instruction cache 22, and main memory 24. In a preferred 
embodiment the instruction cache 22 is part of the processor 20 (being located on-chip). 
The main memory is the memory 18 or receives the computer program machine code 16 
from the memory 18. Referring to Fig. 3, a typical VLIW processor 20 architecture 
includes a pluralities of clusters 26 of functional processing units (FPU's) 28. Each 
cluster 26 includes a common number of functional processing units 28. As a result, 
there is a one to one correspondence of functional processing units 28 in the differing 
clusters 26. Fig. 3 shows a generic architecture having 'n' clusters of 'm' functional 
processing units per cluster. A first cluster has functional processing units (1 ,1) to 
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(l,m). A second cluster has functional processing units (2,1) to (2,m). The n-th 
cluster has functional processing units (n,l) to (n,m). Accordingly there are n*m 
functional processing units. For each cluster 26, there is a dedicated register file 27. 
The values of n, m, and n*m are prescribed numbers determined by the processor 20 
architecture. Such values may vary for differing embodiments. The value of n*m is a 
prescribed number corresponding to the maximum number of subinstructions which 
may be included in a VLIW instruction for a processor having n*m functional 
processing units. 

Referring to Fig. 4, an instruction format 30 for processor 20 includes up to 
n*m subinstruction fields 32. The content of each subinstruction field 32 is routed to a 
corresponding functional processing unit 28 for processing. For an instruction in 
which all n*m subinstruction fields are filled, a subinstruction is routed to each one of 
the n*m functional processing units 28. Typically there is only one program counter 
for all n clusters 26. As a result, the functional processing units typically operate 
synchronously to concurrently execute the subinstructions of a given instruction. 

When a subinstruction field 32i is empty, the instruction 30 is compressed 
using conventional techniques. As a result the memory space occupied by the 
instruction is reduced. This invention relates to additional techniques which compress 
the instruction size when there are redundant subinstructions for corresponding 
functional processing units 28 of multiple clusters 26. In particular, in a tight loop of 
an image computing algorithm it has been observed that the same subinstruction is 
executed in multiple clusters. Conventionally, the subinstruction is repeated in each 
subinstruction field 32, resulting in an inefficient use of instruction cache 22 memory 
space and an inefficient application of memory transfer bandwidth. In a compressed 
instruction format according to an aspect of this invention, subinstructions are shared 
among multiple functional processing units 28. 

Compressed Instruction Format 

Referring to Fig. 5, an example of an uncompressed instruction 34 is shown 
including 'n x m 1 subinstruction fields 32 which store respective subinstructions 36. 
Some subinstruction fields 32 may be blank, (e.g., field 32(2,1)). Some subinstruction 
fields may include the same subinstruction as another subinstruction field. Each 
subinstruction field 32 is associated with a specific functional processing unit 28 of a 
specific cluster 26. In the example illustrated, subinstruction fields (1,1) to (l,m) are 
associated with the functional processing units (1,1) to (l,m) respectively of cluster 1. 
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Subinstruction fields (2,1) to (2,111) are associated with the functional processing units 
(2,1) to (2,m) respectively of cluster 2. Each subinstruction field is similarly 
associated up to subinstruction fields (n,l) to (n,m) being associated with functional 
processing units (n,l) to (n,m) respectively of cluster n, 
5 Note that functional processing units (_,1) for each of clusters 1 through n are 

referred to herein as corresponding functional processing units. In particular, they are 
referred to the corresponding first functional processing unit of each of the multiple 
clusters. When the same subinstruction is included in a given instruction for processing 
by corresponding functional processing units (_,i), the instruction format is compressed 
10 to eliminate the redundancy. Note that when the same subinstruction is included, but 

O for non-corresponding functional processing units (_,i) and (_,j), the redundancy is not 

f2 treated, (i.e., the instruction format is not necessarily compressed). In some 

y± embodiments these redundancies also are treated, but in a preferred embodiment they 

are ignored. Such redundancies are ignored because they do not result in the gains in 
J? 15 efficiency comparable to the gains in treating redundancies among substructions 

Ul destined for corresponding functional processing units. 

^ Referring to Fig. 6, the exemplary instruction 34 is shown in a conventional 

yj compressed format 34' in which the blank fields are omitted. The location of where the 

blank field would occur in the uncompressed format is shown with an asterisk ('*'). 
q20 Referring to Fig. 7, the exemplary instruction 34 is shown in a compressed 

Q format 34" according to an aspect of this invention. In the compressed format, there is 

an area for a set 37 of instruction-compression control bits, along with one or more, 
preferable non-empty, subinstruction fields. Substructions fields for corresponding 
FPU's (_,i) which store the same subinstruction are reduced to include the 
25 subinstruction for only one of the corresponding functional processing units. Such 

corresponding functional processing units share the subinstruction. 

With regard to the exemplary instruction depicted, note that the subinstruction 
which is destined for the second functional processing unit of both the first cluster and 
the n-th cluster have a common subinstruction. These FPU's (1,2) and (n,2) are 
30 corresponding functional processing units. Accordingly, the redundant subinstruction 

is omitted. In various embodiment the redundant subinstruction is omitted at the first 
occurrence, second occurrence or another occurrence. In the embodiment illustrated all 
but the first occurrence is omitted. The location where the omitted redundant field 
would occur in the uncompressed format is shown with a double asterisk ('**'). Also 
35 note that the empty subinstruction field also is compressed. In various embodiments 
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the empty field may or may not be compressed, depending on whether the conventional 
compression technique is also implemented. 

Also note that subinstruction fields 32(1, m) and 32(2,2) each have a common 
subinstruction r C\ In some embodiments a compression operation is performed to 
avoid this redundancy. However, because such redundancy has been found as not to 
occur very often, the redundancy is left 'as is' in a preferred embodiment. Similarly, 
the substructions fields 32(n,l) and 32(n,3) also have a common subinstruction, 'E'. 
These are destined for FPU's in a common cluster. In some embodiments a 
compression operation is performed to avoid this redundancy. However, because such 
redundancy has been found as not to occur very often, the redundancy is left 'as is' in a 
preferred embodiment. 

The set 37 of instruction-compression control bits includes enough bits to 
identify each possible condition in which corresponding functional processing units are 
to share a subinstruction. For example, where there are two clusters of 'm' FPU's per 
cluster, then the set 37 includes 'm' control bits. Where there are 'n' clusters of two 
FPU's per cluster, then the set 37 includes 'n' control bits. In an architecture having 
f n' clusters of 'm' FPU's per cluster, the set 37 in a best mode embodiment includes 
n+m control bits, where n >2 and m>2. In other embodiments the number of control 
bits may vary. Table 1 below shows the bit encoding for an architecture in which the 
are 2 clusters of 2 functional processing units per cluster. For such architecture there 
are 2 control bits in the set 37. 

Table 1: Control Bit Encoding 

00 No subinstruction sharing 

01 First subinstruction in compressed format is shared by FPU's (_,1) 

10 Second subinstruction in compressed format is shared by FPU's (_,2) 

1 1 First subinstruction in compressed format is shared by FPU's (_,1), & 
Second Subinstruction in compressed format is shared by FPU's (_,2) 

Table 2 below shows the bit encoding for an architecture in which the are 2 
clusters of 3 functional processing units per cluster. For such architecture there are 3 
control bits in the set 37. 
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Table 2: Control Bit Encoding 

000 No subinstruction sharing 

001 First subinstruction in compressed format is shared by FPU's 

010 Second subinstruction in compressed format is shared by FPU's (_,2) 

011 First subinstruction in compressed format is shared by FPU's CI), & 
Second subinstruction in compressed format is shared by FPU's G_,2) 

100 Third subinstruction in compressed format is shared by FPU's (_,3) 

101 First subinstruction in compressed format is shared by FPU's (_,1), & 
Third subinstruction in compressed format is shared by FPU's (_,3) 

1 10 Second subinstruction in compressed format is shared by FPU's (_,2), 
Third subinstruction in compressed format is shared by FPU's (_,3) 

1 1 1 First subinstruction in compressed format is shared by FPU's (_,1), 
Second subinstruction in compressed format is shared by FPU's C_,2), 
Third subinstruction in compressed format is shared by FPU's (_,3) 

In various embodiments there are various encoding schemes that can be 
implemented to identify each potential compression condition in which a subinstruction 
is to be shared among two or more corresponding FPU's (_,i). 

Because it may be undesirable to add a significant amount of control bits to 
each instruction, a subset of compression conditions may be identified by a reduced 
number of control bits. For example, in a four cluster architecture with two FPU's per 
cluster, four control bits could be used in a similar way as specified above, or three 
control bits could be used as described in Table 3 below. 

Table 3: Control Bit Encoding 

000 No subinstruction sharing 

001 First subinstruction in compressed format is shared by all FPU's (i,l), i=l,4 

010 Second subinstruction in compressed format is shared by FPU's (i,2), i=l,4 

01 1 First subinstruction in compressed format is shared by FPU's (i,l), 
Second subinstruction in compressed format is shared by FPU's (i,2), i=l,4 

100 First subinstruction in compressed format is shared by FPU's (1,1), (3,1), 
Second subinstruction in compressed format is shared by FPU's (1,2), (3,2), 
Third subinstruction in compressed format is shared by FPU's (2,1), (4,1), 
Fourth subinstruction in compressed format is shared by FPU's (2,2), (4,2) 

101 First subinstruction in compressed format is shared by FPU's (1,1), (2,1), 
Second subinstruction in compressed format is shared by FPU's (1,2), (2,2), 
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Third substruction in compressed format is shared by FPU's (3,1), (4,1), 
Fourth subinstruction in compressed format is shared by FPU's (3,2), (4,2) 

1 10 First subinstruction in compressed format is shared by FPU's (1,1), (2,1), 
Second subinstruction in compressed format is shared by FPU's (1,2), (2,2), 
Third through sixth not shared 

111 First though Fourth not shared, 

Fifth subinstruction in compressed format is shared by FPU's (3,1), (4,1), 
Sixth subinstruction in compressed format is shared by FPU's (3,2), (4,2), 

One skilled in the art will appreciate that different encoding schemes may be 
implemented to identify a variety of subinstruction sharing conditions. Different 
decoding architectures would accompany the different encoding schemes to implement a 
desired subinstruction sharing scheme. 

Subinstruction Sharing 

Referring to Fig. 9, an exemplary multiplexing scheme is depicted for 
decoding the set of control bits and determining which substructions if any are to be 
shared among corresponding FPU's of a VLIW processor. In one embodiment, the 
processor 20 includes logic for performing such decoding and subinstruction sharing. 
In the illustrated embodiment, therere two clusters 26 of two functional units 28 per 
cluster. A VLIW instruction 42 is retrieved from instruction cache 22 and parsed based 
upon the condition of the set 37 of control bits. The VLIW 42 instruction for such 
embodiment includes two, three or four subinstruction fields 32. 

A multiplexer 44 couples the first functional unit of the second cluster to the 
first subinstruction field and the third subinstruction field of the instruction 42. A 
multiplexer 46 couples the second functional unit of the second cluster to the second 
subinstruction field and the fourth subinstruction field of the instruction 42. According 
to the decoding scheme in Table 1 above, instruction 42 includes four substructions 
when the set 37 has an encoding condition of 00. Each subinstruction is routed to a 
separate FPU. Instruction 42 includes three substructions when the set 37 has an 
encoding condition of 01 or 10. When encoded to 01, the multiplexer 44 selects the 
first subinstruction. Thus, the first functional unit of clusters 1 and 2 share the first 
subinstruction. The second subinstruction goes to the second FPU of the first cluster. 
The third subinstruction is shifted over to enter multiplexer 46, which selects such third 
subinstruction for processing by the second FPU of the second cluster. 
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When the set 46 is encoded to 10, the first substruction goes to the first FPU 
of the first cluster and the second substruction goes to the second FPU of the first 
cluster. The multiplexer 44 selects the third substruction, so the third substruction 
goes to the first FPU in the second cluster. The multiplexer 46 selects the second 
substruction, so the second substruction is shared by the second FPU of the first 
cluster and the second FPU of the second cluster. 

Instruction 42 includes two substructions when the set 37 has an encoding 
condition of 1 1. In such case the multiplexer passes the first substruction, so the first 
substruction is shared by the first FPU of the first cluster and the first FPU of the 
second cluster. Similarly, multiplexer 46 passes the second substruction, so the 
second substruction is shared by the second FPU of the first cluster and the second 
FPU of the second cluster. 

Referring to Fig. 10A-E, substruction sharing is compared for various 
instructions 42A to 42E on a processor having n=2 clusters and m=2 FPU's per 
cluster. Each instruction includes up to four substructions 36. The four 
substructions are arranged in two rows to visually correlate the substruction to its 
destination FPU. Specifically, the substructions in the top row are destined for the 
first and second FPU's (1,1), (1,2), respectively, of the first cluster, while the 
substructions in the bottom row are destined for the first and second FPU's (2,1), 
(2,2), respectively, of the second cluster. In addition, the instruction bit sizes are 
shown for a substruction size equal 32 bits. Shown for each instruction 42 are the 
intended operation 48 (on left), the instruction 50 (center) with NOP compression only 
and the instruction 42 compressed for substruction sharing (on right). 

Table 4, below summarizes the number of instruction bits used to specify 
different instruction cases with and without the subinstruction sharing. Where N is the 
number of non-empty subinstruction fields in an instruction, the original compressed 
instruction will be 32 x N-bit long after instruction compression. With the 
subinstruction sharing, however, there are different lengths depending on the 
redundancy level in an instruction. If, for example, the control bits 37 are 00, (i.e., no 
subinstruction sharing), then there are 32 x N + 2 bits for the instruction, including 2 
bits of overhead compared to the original instruction. However, when the control bits 
37 are 01 or 10, one subinstruction field is omitted by subinstruction sharing. The 
result is 32 x (N - 1) + 2 bits, which saves 30 bits for the instruction. For the case, 
where the control bits are 1 1, two subinstruction fields are omitted and the number of 
bits is 32 x (N - 2) + 2, which saves 62 bits for the instruction. 



Docket No.: OT2.P59 



Table 4: Number of bits per instruction ANT: number of non-empty substruction fields^ 



S bit field 


Number of bits per instruction 


Original 


With substruction sharing 


Difference 


00 


32 xN 


32xN + 2 


-2 


01 or 10 


32x(N- l) + 2 


30 


11 


32 x (N - 2) + 2 


62 



The actual effect of substruction sharing in an image computing program was 
studied, in which several tight loop routines (2D convolution, 2D complex FFT, and 
affine warping) were written in assembly language for the MAP 1000 processor. The 
MAP 1000 has two clusters of two FPU's per cluster. Assuming that each 
substruction is 32-bits wide, the number of instructions in the tight loop and their 
redundancy characteristics are listed below in Table 5. For a 2D convolution, the 
number of instruction bits that can be saved by substruction sharing is calculated as -2 
x 48 + 30 x 40 + 62 x 45 = 3894 bits. The total number of non-empty substructions 
in the convolution tight loop was 337 in 133 instructions. Thus the original program 
size is 337 x 32 = 10784 bits. Therefore, the substruction sharing results in 36.1% 
reduction in the tight loop program size as shown in Table 6. Similarly, the 2D 
complex FFT and affine warping tight loops show 23.9% and 41.9% of reduction in 
the program size, respectively. 



Table 5: Analysis Of Several Image Computing Function Tight Loops 



Control Bits 
Control Bits 


Number of instructions 


2D convolution 


2D complex FFT 


Affine warping 


00 


48 


220 


60 


01 or 10 


40 


207 


41 


11 


45 


166 


65 
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Table 6: Results Of Subinstruction Sharing In Image Computing Functions 



Original program size (total 
number of non-empty 
instructions) 



Functions 



2D convolution 



337 



2D complex FFT 



2097 



Affine warping 



383 



Total number of bits reduced 



3894 



16062 



5140 



Program size reduction % 



36.1% 



23.9% 



41.9% 



10 



15 



20 



25 



Note that the program size reduction discussed above is for the tight loop only. 
When the caller function is considered together, the effect of subinstruction sharing 
would be longer. However, the effect is still very significant. Consider, for example, 
an application program that reads a512x512 8-bit image, calls the 2D convolution 
tight loop, and writes the output image back to the memory occupies about 100 kbytes. 
The total program size reduction achieved by the subinstruction sharing is less than 
0.5%. However, since most of the program outside the tight loop is executed only 
once while the tight loop is iterated many times, most of the program execution time is 
in fact spent within the tight loop. In the case of a 2D convolution with a 15 x 15 kernel 
on the MAP 1000, the tight loop execution time occupies more than 89% of the total 
execution time. Therefore, fitting the tight loop in the available instruction cache is 
more important than reducing the whole program size. Moreover, when a more 
sophisticated tight loop (thus requiring more bits for its instructions) is developed 
and/or multiple tight loops are combined to form a new higher-level tight loop, it is 
desirable that the size of individual tight loops be as small as possible so that the new 
tight loop does not cause instruction cache thrashing, i.e., excessive instruction cache 
misses while iterating the tight loop. 

Method for Identifying and Sharing Redundant Substructions 

Referring to Fig. 1 1, a flow chart 60 for identifying subinstruction sharing 
opportunities includes a step 62 in which the substructions of a given instruction are 
compared to determine whether a subinstruction sharing condition is present. In one 
embodiment, any subinstruction which appears for one or more corresponding 
functional processing units (_,i) of multiple clusters is to be shared. In other 
embodiments a more limited set of conditions is specified according to a specific 
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design. Table 3 above, for example, lists an example of a limited set of conditions. At 
step 64 the set 37 of instruction compression control bits is set to identify each 
subinstruction sharing condition. Thereafter the instruction is stored in memory at step 
68. In some embodiments the instruction is stored in an uncompressed format, (or in a 
5 format using only conventional compression techniques, such as NOP compression, 

without the subinstruction sharing compression). In another embodiment, the 
instruction is compressed at step 66 to omit redundant substructions where a 
subinstruction is to be shared. 

For embodiments in which the instruction is stored in memory without 

10 removing redundancies for subinstruction sharing, a method is performed at another 

time to compress or further compress the instruction. Referring to Fig. 12 at step 70 of 
flow chart 69, the set 37 of instruction-compression control bits are tested to identify 
subinstruction sharing conditions. According to the encoded condition of the set 37 one 
or more substructions are deleted at step 72 from the instruction format. The deleted 

15 substructions are redundant substructions. An identical subinstruction remains 

which is to be shared by FPU's. The result is a compressed instruction or a further 
compressed instruction. Such resulting compressed instruction is routed to an 
instruction cache 22, a primary cache or into main memory 24. By reducing the size of 
the instruction, the space required in instruction cache, along with the time required to 

20 move the data into the cache are reduced. The method of flow chart 69 is performed in 

various embodiments when an instruction is moved from main memory 24 into the 
instruction cache 22 (see Fig. 2) or at another time. 

Referring to Fig. 13, a flow chart 74 of a method for decoding the set 37 of 
instruction-compression control bits includes a step 76 of testing the control bits for 

25 various subinstruction sharing conditions. At step 78 the compressed instruction 42 is 

parsed to route the substructions to destination FPU's 28. When a subinstruction 
sharing condition is present, a subinstruction is routed to a plurality of corresponding 
functional processing units. 

30 Alternative Embodiments. Meritorious and Advantageous Effects 

Although subinstruction sharing cases have been described for redundant 
substructions among corresponding FPU's, there are additional situations of 
redundant substructions which also may be covered in some embodiments. For a 
generic architecture in which an instruction includes 'p' subinstruction fields, there is 
35 one situation of no redundant substructions, and not more than 2P-1 situations in 
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which there is some degree of redundancy among substructions. For an architecture 
in which there are p=8 subinstruction fields, there are not more than 2 8 = 256 
situations. Some situations turn out to be the same, so the number of situations is 
slightly less than 256. To cover every such situation, however, there are 'p 1 control 
5 bits in the set 37 of instruction-compression control bits. Thus, in one embodiment 

there are 'p' control bits included with each instruction. 

However, as the instruction width increases, it may be undesirable to add so 
many extra control bits for subinstruction sharing. In particular, the cost of so many 
bits may seem excessive when there tends to be a pattern among the substructions 

10 redundancies that come up over and over again in practice. As a result, in the preferred 

embodiments the number of control bits is reduced to less than p to handle a prescribed 
number of the possible 2P subinstruction sharing situations. Different processors are 
designed for different application where the pattern of subinstruction redundancies 
varies. Further the cases of subinstruction redundancies covered for subinstruction 

15 sharing are strategically selected for a given processor to have greatest impact on those 

applications for which the processor is targeted, (e.g., for image processing 
applications). The preferred embodiments described in the prior sections relate to 
subinstruction sharing scenarios situations that have been found to occur in strategically 
important tight loops of common image processing functions. 

20 An advantage of the invention is that the required instruction space in an 

instruction cache is effectively reduced for VLIW instructions. In particular, for some 
functions executed during image processing algorithms have occupy tight loops, it is 
possible to maintain the tight loop without thrashing, where otherwise, thrashing would 
occur. 

25 Another advantage is that by avoiding some redundancies in VLIW 

substructions, fewer instruction bits are needed, and correspondingly program size is 
reduced. In addition, efficiency of the instruction cache usage is improved, and 
instruction fetch bandwidth is increased. 

Although a preferred embodiment of the invention has been illustrated and 

30 described, various alternatives, modifications and equivalents may be used. Therefore, 

the foregoing description should not be taken as limiting the scope of the inventions 
which are defined by the appended claims. 
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WHAT IS CLAIMED IS : 

1 . A method for sharing a substruction of a given instruction among 
functional processing units of a plurality of clusters on a processor having a very long 
instruction word architecture, the given instruction including a set of control bits and at 
least one substruction, the processor comprising the plurality of clusters, each one 
cluster of the plurality of clusters comprising a plurality of functional processing units, 
the method comprising the steps of: 

testing the set of control bits to identify a prescribed condition; 

when the prescribed condition is identified, routing said substruction of the 
given instruction to multiple functional processing units as determined by the prescribed 
condition; 

concurrently executing the substruction at said multiple functional processing 

units. 

2 . The method of claim 1 , in which the step of routing comprises routing 
said substruction of the given instruction to a first functional processing unit of a first 
cluster of the plurality of clusters and to a first functional processing unit of a second 
cluster of the plurality of clusters; and in which the step of executing comprises 
concurrently executing the substruction at said first functional processing unit of the 
first cluster of the plurality of clusters and to the first functional processing unit of the 
second cluster of the plurality of clusters. 

3 . The method of claim 2, in which the given instruction comprises a first 
substruction and a second substruction, the step of testing comprising testing the set 
of control bits to identify a first prescribed condition, the step of routing comprising 
routing the first substruction, the method further comprising the steps of: 

testing the set of control bits to identify a second prescribed condition; 

when the second prescribed condition is identified, routing said second 
substruction of the given instruction to a second functional processing unit of the first 
cluster of the plurality of clusters and to a second functional processing unit of the 
second cluster of the plurality of clusters; and 

concurrently executing the substruction at the first functional processing unit 
and the second functional processing unit; and 

wherein the step of executing comprises concurrently executing the first 
substruction at the first functional processing unit of the first cluster, the first 
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subinstruction at the first functional processing unit of the second cluster, the second 
substruction at the second functional processing unit of the first cluster and the second 
subinstruction at the second functional processing unit of the second cluster. 

4 . A method for storing an instruction of a computer program to be 
executed on a processor having a very long instruction word architecture, 

wherein each instruction comprises at least one subinstruction and up to a first 
prescribed number of substructions, the first prescribed number being at least two, 

wherein the processor is organized into a plurality of clusters equaling a second 
prescribed number, each one cluster of the plurality of clusters comprising a common 
number of functional processing units, wherein the common number of functional 
processing units times the second prescribed number equals the first prescribed 
number, 

wherein for a given instruction having the first prescribed number of 
substructions, each functional processing unit of the plurality of clusters is for 
executing a respective subinstruction of the given instruction, the method comprising 
the steps of: 

identifying a pattern in which a subinstruction occurs more than once in the 
given instruction, said subinstruction being a redundant subinstruction; 

determining whether the pattern is among a set of prescribed patterns; 

when the pattern is among the set of prescribed patterns, setting a set of control 
bits for the instruction to indicate that said pattern is present. 

5 . The method of claim 3, further comprising compressing the given 
instruction when the pattern is among the set of prescribed patterns by deleting one 
occurrence of the redundant subinstruction in the given instruction to achieve a 
compressed instruction. 

6. The method of claim 5, further comprising the steps of: 
moving the compressed instruction into instruction cache; 

testing the set of control bits of the compressed instruction to determine a 
condition is identified in which subinstruction sharing is to occur for the compressed 
instruction; 
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when substruction sharing is determined to occur, parsing the compressed 
instruction to route the redundant substruction to a plurality of functional processing 
units as determined by the identified condition; 

concurrently executing the subinstruction at said plurality of functional 
processing units. 

7 . A method for storing an instruction of a computer program to be 
executed on a processor having a very long instruction word architecture, 

wherein each instruction comprises at least one subinstruction and up to a first 
prescribed number of substructions, the first prescribed number being at least four, 

wherein the processor is organized into a plurality of clusters equaling a second 
prescribed number, each one cluster of the plurality of clusters comprising a common 
number of functional processing units, wherein the common number of functional 
processing units times the second prescribed number equals the first prescribed 
number, 

wherein for a given instruction having the first prescribed number of 
substructions, each functional processing unit of the plurality of clusters is for 
executing a respective subinstruction of the given instruction, the method comprising 
the steps of: 

for the given instruction, comparing a first subinstruction which is to be 
processed by a first functional unit of a first cluster of the plurality of clusters with a 
second subinstruction which is to be processed by a first functional unit of a second 
cluster of the plurality of clusters; 

for a case in which the first subinstruction is the same as the second 
subinstruction setting a first control bit of a set of control bits associated with the given 
instruction to a first logic state which indicates that the second subinstruction equals the 
first subinstruction; 

for the given instruction, comparing a third subinstruction which is to be 
processed by a second functional unit of the first cluster of the plurality of clusters with 
a fourth subinstruction which is to be processed by a second functional unit of the 
second cluster of the plurality of clusters; and 

for a case in which the third subinstruction is the same as the fourth 
subinstruction setting a second control bit of the set of control bits associated with the 
given instruction to a second logic state which indicates that the fourth subinstruction 
equals the third subinstruction; and 
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storing the given instruction with the first control bit and the second control bit. 

8 . The method of claim 7, in which the step of storing comprises storing 
the given instruction in an uncompressed format, and further comprising the steps of 
compressing the given instruction into a compressed format, and storing the given 
instruction in cache in the compressed format, the step of compressing comprising the 
steps of: 

testing the first control bit associated with the given instruction; 

for a case in which the first control bit equals the first logic state compressing 
the given instruction to reduced size in which one copy of the equal first substruction 
and second subinstruction is omitted to avoid redundant storage of the first 
subinstruction and the second subinstruction; 

testing the second control bit associated with the given instruction; and 

for a case in which the second control bit equals the second logic state 
compressing the given instruction to reduced size in which one copy of the equal third 
subinstruction and fourth subinstruction is omitted to avoid redundant storage of the 
third subinstruction and the fourth subinstruction. 

9 . The method of claim 7, in which the step of storing comprises storing 
the given instruction in a compressed format, and further comprising prior to the step of 
storing, the step of compressing the given instruction into the compressed format, the 
step of compressing, comprising the steps of: 

when the first control bit equals the first logic state compressing the given 
instruction to reduced size in which one copy of the equal first subinstruction and 
second subinstruction is omitted to avoid redundant storage of the first subinstruction 
and the second subinstruction; 

when the second control bit equals the second logic state compressing the 
given instruction to reduced size in which one copy of the equal third subinstruction and 
fourth subinstruction is omitted to avoid redundant storage of the third subinstruction 
and the fourth subinstruction. 

1 0. The method of claim 9, further comprising the step of storing the given 
instruction in cache in the compressed format. 
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1 1 . The method of claim 7, further comprising the steps of: 
storing the given instruction in cache in a compressed format with the first 
control bit and the second control bit, the compressed format combining the storage of 
the first subinstruction with the storage of the second subinstruction into a first 
combined storage when the first control bit is set to the first logic state, the compressed 
format combining the storage of the third subinstruction with the storage of the fourth 
subinstruction into a second combined storage when the second control bit is set to the 
second logic state; 

testing the first control bit; 

when the first control bit is set to the first logic state, routing a content of the 
first combined storage to the first functional processing unit of the first cluster and the 
first functional processing unit of the second cluster for concurrent execution by the 
first functional processing unit of the first cluster and the first functional processing unit 
of the second cluster; 

testing the second control bit; and 

when the second control bit is set to the second logic state, routing a content of 
the second combined storage to the second functional processing unit of the first cluster 
and the second functional processing unit of the second cluster for concurrent execution 
by the second functional processing unit of the first cluster and the second functional 
processing unit of the second cluster. 

12, A method for compressing into a compressed format, an instruction of 
a computer program to be executed on a processor having a very long instruction word 
architecture, 

wherein each instruction comprises at least one subinstruction and up to a first 
prescribed number of substructions, the first prescribed number being at least four, 

wherein the processor is organized into a plurality of clusters equaling a second 
prescribed number, each one cluster of the plurality of clusters comprising a common 
number of functional processing units, wherein the common number of functional 
processing units times the second prescribed number equals the first prescribed 
number, 

wherein for a given instruction having the first prescribed number of 
substructions, each functional processing unit of the plurality of clusters is for 
executing a respective subinstruction of the given instruction, the method comprising 
the steps of: 



22 



Docket No.: OT2.P59 



for the given instruction, comparing a first subinstruction which is to be 
processed by a first functional unit of a first cluster of the plurality of clusters with a 
second subinstruction which is to be processed by a first functional unit of a second 
cluster of the plurality of clusters; 

for a case in which the first subinstruction is the same as the second 
subinstruction compressing the given instruction to be stored with the first 
subinstruction and without the second subinstruction, and setting a first control bit 
associated with the given instruction to a logic state which indicates that the second 
subinstruction equals the first subinstruction; 

for the given instruction, comparing a third subinstruction which is to be 
processed by a second functional unit of the first cluster of the plurality of clusters with 
a fourth subinstruction which is to be processed by a second functional unit of the 
second cluster of the plurality of clusters; and 

for a case in which the third subinstruction is the same as the fourth 
subinstruction compressing the given instruction to be stored with the third 
subinstruction and without the fourth subinstruction, and setting a second control bit 
associated with the given instruction to a logic state which indicates that the fourth 
subinstruction equals the third subinstruction. 

13. A computer system comprising: 

a processor having a very large word instruction architecture and 
including a plurality of clusters of functional processing units, each one cluster of the 
plurality of clusters comprising a common number of functional processing units, the 
processor comprising a first prescribed number of clusters, said very large word 
instruction architecture allowing an instruction to have up to a second prescribed 
number of substructions, where the second prescribed number equals the first 
prescribed number times the common number, each instruction to be executed by the 
processor comprising from one subinstruction up to the second prescribed number of 
substructions, along with a set of control bits; and 

an instruction cache memory which stores a first instruction in a 
compressed format determined by a condition of the set of control bits, the compressed 
format including a shared subinstruction stored in a given field of the first instruction 
which is to be shared by a plurality of the functional processing units, said plurality of 
functional processing units being determined by said condition of the set of control bits. 
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14. The system of claim 13, in which said shared substruction is for 

a first functional processing unit of a first cluster and a first functional processing unit 
of a second cluster when the set of control bits identifies a first prescribed condition. 

1 5 . The system of claim 14, in which the shared substruction is a first 
shared substruction, and in which the compressed format further includes a second 
shared subinstruction for a second functional processing unit of the first cluster and a 
second functional processing unit of the second cluster when the set of control bits 
either concurrently identifies a second prescribed condition. 

1 6 . The system of claim 14, further comprising: 

means for testing the set of control bits for a given instruction; and 
means for routing said first common subinstruction to the first functional 
processing unit of the first cluster and to the first functional processing unit of the 
second cluster of the plurality of clusters when said testing means identifies the first 
prescribed condition. 

17. The system of claim 16, in which the first common subinstruction is 
concurrently executed at the first functional processing unit of the first cluster and the 
first functional processing unit of the second cluster. 

1 8 . The system of claim 14, in which the first instruction in an 
uncompressed format includes the second prescribed number of substructions, the 
first instruction comprising a first subinstruction for being executed by a first functional 
processing unit of a first cluster and a second subinstruction for being executed by a 
first functional processing unit of a second cluster, the system further comprising 
means for compiling the first instruction, the compiling means comprising: 

means for comparing the first subinstruction and the second subinstruction; 
means for setting a state of the set of control bits to identify a first prescribed 
condition when the first subinstruction is equal to the second subinstruction. 
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19. The system of claim 14, in which a first instruction in uncompressed 
format includes the second prescribed number of substructions, the first instruction 
comprising a first subinstruction for being executed by a first functional processing unit 
of a first cluster and a second subinstruction for being executed by a first functional 
5 processing unit of a second cluster, the system further comprising means for 

compressing the first instruction into the compressed format, the compressing means 
comprising: 

means for testing the set of control bits associated with the first instruction; 
means for reducing the size of the first instruction by omitting the second 
10 subinstruction when the set of control bits identifies that the first subinstruction equals 

the second subinstruction. 



20. The system of claim 14, in which a first instruction in uncompressed 
format includes the second prescribed number of substructions, the first instruction 

15 comprising a first subinstruction for being executed by a first functional processing unit 

of a first cluster and a second subinstruction for being executed by a first functional 
processing unit of a second cluster, the system further comprising means for caching 
the first instruction, the caching means comprising: 

means for testing the set of control bits associated with the first instruction; 

20 means for reducing the size of the first instruction to achieve a compressed 

format by omitting the second subinstruction when the set of control bits identifies that 
the first subinstruction equals the second subinstruction; and 

means for loading the first instruction into the instruction cache in the 
compressed format. 

25 

2 1 . The system of claim 14, in which a first instruction in uncompressed 
format includes the second prescribed number of substructions, the first instruction 
comprising a first subinstruction for being executed by a first functional processing unit 
of a first cluster and a second subinstruction for being executed by a first functional 

30 processing unit of a second cluster, the system further comprising means for caching 

the first instruction, the caching means comprising: 

means for comparing the first subinstruction and the second subinstruction; 
means for setting a state of the set of control bits associated with the first 
instruction to identify a first prescribed condition when the first subinstruction is equal 
35 to the second subinstruction. 
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means for reducing the size of the first instruction to achieve a compressed 
format by omitting the second substruction when the set of control bits identifies that 
the first subinstruction equals the second subinstruction; and 

means for loading the first instruction into the instruction cache in the 
compressed format. 
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METHOD AND APPARATUS FOR COMPRESSING VLIW 
INSTRUCTION AND SHARING SUBSTRUCTIONS 

ABSTRACT OF THE DISCLOSURE 

5 

A VLIW instruction format is introduced having a set of control bits which 
identify subinstruction sharing conditions. At compilation the VLIW instruction is 
analyzed to identify subinstruction sharing opportunities. Such opportunities are 
encoded in the control bits of the instruction. Before the instruction is moved into the 

10 instruction cache, the instruction is compressed into the new format to delete select 

redundant occurrences of a subinstruction. Specifically, where a subinstruction is to be 
shared by corresponding functional processing units of respective clusters, the 
subinstruction need only appear in the instruction once. The redundant appearance is 
deleted. The control bits are decoded at instruction parsing time to route a shared 

15 subinstruction to the associated functional processing units. 
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Last Name 

Kim 


First Name 

Donglok 


Middle Name or Initial 


Residence & 
Citizenship 


City 
Seattle 


State or Foreign Country 
Washington 


Country of Citizenship 
Republic of Korea 


Post Office 
Address 


Post Office Address 

5290 Mithun Place, N.E. 


City 
Seattle 


State or Country 
Wa. 


Zip code 
98105 




2 


Inventor Full 
Name 


Last Name 

Berg 


First Name 

Stefan 


Middle Name or Initial 

G. 


Residence & 
Citizenship 


City 

Seattle 


State or Foreign Country 

Washington 


Country of Citizenship 

U.S. 


Post Office 
Address 


Post Office Address 

5212 University Way NE Apt 203 


City 
Seattle 


State or Country 

Wa. 


Zip code 
98105 




3 


Inventor Full 
Name 


Last Name 
Sun 


First Name 
Weiyun 


Middle Name or initial 


Residence & 
Citizenship 


City 
Seattle 


State or Foreign Country 

Washington 


Country of Citizenship 

Peoples Republic 
of China 


Post Office 
Address 


Post Office Address 

818 NE 106th Street 


City 

Seattle 


State or Country 

Wa. 


Zip code 

98125 
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Inventor Full 
Name 


Last Name 
Kim 


First Name 
Yongmin 


Middle Name or Initial 


Residence & 
Citizenship 


City 
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State or Foreign Country 

Washington 


Country of Citizenship 
Republic of Korea 


Post Office 
Address 


Post Office Address 
4431 NE 189th Place 


City 

Seattle 


State or Country 

Wa. 


Zip code 

98155 



18 USC 1001 DECLARATION: I further declare that all statements made herein of my own knowledge are 
true and that all statements made on information and belief are believed to be true; and further that these 
statements were made with knowledge that willful false statements and the like so made are punishable by fine or 
imprisonment, or both, under section 1001 of Title 18 of the United States Code, and that such willful false 
statements may jeopardize the validity of the application or any patent issuing thereon. 

SIGNATURES: 



Signature Inventor I (Donglok Kim) 


Signature Inventor 2 (S. Berg) 


Signature Inventor 3 (W. Sun) 


Date 


Date 


Date 


Signature Inventor 4 (Yongmin Kim) 


Signature Inventor 5 


Signature Inventor 6 


Date 


Date 


Date 
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KODA LAW OFFICE Atty. Docket No OT2.P59 

DECLARATION AND POWER OF ATTORNEY 
37 CFR 1.63 DECLARATION: As a below named inventor, I hereby declare that my residence, post office 
address and citizenship are as stated below next to my name. I believe I am the original, first and sole inventor (if 
only- one name is listed below) or an original, first and joint inventor (if plural inventors are named below) of the 
subject matter which is claimed and for which a patent is sought on the invention entitled: Method and 
Apparatus for Compressing VLIW Instruction and Sharing Subinstructions. the specification of which fX] is 
attached hereto, or [ ] was filed on as Application Serial No. and was amended on (if applicable). 

DUTY OF DISCLOSURE: I have reviewed and understand the contents of the above identified application, 
including the claims, as amended by any amendment referred to above. I acknowledge the duty to disclose 
information which is material to the examination of this application in accordance with 37 CFR 1.56(a), 
[_J In compliance with this duty, attached is an Information Disclosure Statement under 37 CFR 1.97. 



35 USC 119 PRIORITY: I claim priority benefits under Title 35, United States Code, Section 1 19(e) of any 
provisional application(s) for patent listed below: 



Country 


Application Number 


Date of Filing 


Priority Claimed Under 35 USC 119 



[ X ] Yes [ ] No 



35 USC 120 BENEFIT: I claim the benefit under Title 35, United States Code, Section 120 of any United States 
application(s) listed below and, insofar as the subject matter of each of the claims of this application is not 
disclosed in the prior United States application in the manner provided by the first paragraph of Title 35, United 
States Code, Section 1 12, I acknowledge the duty to disclose material information as defined in Title 37, Code of 
Federal Regulations, Section 1.56(a) which occurred between the filing date of the prior application and the 
national or PCT international filing date of this application: 



Application Serial No. 



Date of Filing - 



Status 



[ ] Patented [ ] Pending [ ] Abandoned 



POWER OF ATTORNEY: As a named inventor, I hereby appoint the following attorney(s) and/or agent(s) to 
prosecute this application and transact all business in the Patent and Trademark Office connected therewith: : 

Steven P. Koda (Reg. No. 32,252) 



Send Correspondence to: 

KODA LAW OFFICE 

P.O. Box 10057 

Bainbridge Island, WA 98110 



Direct Telephone Calls To: 

Steven P. Koda 
(206) 780-8336 
FAX (206) 780-8353 
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KODA LAW OFFICE Atty. Docket No. QT2.P59 

DECLP ^ A HON AND POWER OF ATTORNEY 

INVENTOR(S): 
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Residence & 
Citizenship 
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State or Foreign Country 

Washington 


Country of Citizenship 

Republic of Korea 




Post Office 
Address 


Post Office Address 

5290 Mithun Place, N.E. 


City 

Seattle 


State or Country 

Wa. 


Zip code 

98105 
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Middle Name or Initial 
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Residence & 
Citizenship 


City 
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State or Foreign Country 

Washington 


Country of Citizenship 

U.S. 
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Address 


Post Office Address 

5212 University Way NE Apt 203 


City 

Seattle 


State or Country 

Wa. 


Zip code 

98105 





Inventor Full 
Name 


Last Name 

Sun 


First Name 

Weiyun 


Middle Name or Initial 


3 


Residence & 
Citizenship 


City 
Seattle 


State or Foreign Country 

Washington 


Country of Citizenship 
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of China 
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City 
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State or Country 
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Zip code 
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Name 
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Kim 
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Residence & 
Citizenship 


City 
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State or Foreign Country 
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Country of Citizenship 

Republic of Korea 




Post Office 
Address 


Post Office Address 

4431 NE 189th Place 


City 

Seattle 


State or Country 

Wa. 


Zip code 

98155 



18 USC 1001 DECLARATION: I further deetere that all statements made herein of my own knowledge are 
true and that all statements made on information and belief are believed to be true; and further that these 
statements were made with knowledge that willful false statements and the like so made are punishable by fine or 
imprisonment, or both, under section 1001 of Title 18 of the United States Code, and that such willful false 
statements may jeopardize the validity of the application or any patent issuing thereon. 

SIGNATURES: 



Signature Inventor 1 (Dongiok Kim) 


Signature Inventor 2 (S. Berg) 


Signature Inventor 3 (W. Sun) 


Date 


Date \\^\°m^ 


Date 


Signature Inventor 4 (Yongmin Kim) 


Signature Inventor 5 


Signature Inventor 6 


Date 


Date 


Date 
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KODA LAW OFFICE ^ Atty. Docket No OT2.P59 

DECLARATION AND POWER OF ATTORNEY 
37 CFR 1.63 DECLARATION: As a below named inventor, I hereby declare that my residence, post office 
address and citizenship are as stated below next to my name. I believe I am the original, first and sole inventor (if 
only one name is listed below) or an original, first and joint inventor (if plural inventors are named below) of the 
subject matter which is claimed and for which a patent is sought on the invention entitled: Method and 
Apparatus for Compressing VLIW Instruction and Sharing Substructions, the specification of which [X] is 
attached hereto, or [ ] was filed on as Application Serial No. and was amended on (if applicable). 

DUTY OF DISCLOSURE: I have reviewed and understand the contents of the above identified application, 
including the claims, as amended by any amendment referred to above. I acknowledge the duty to disclose 
information which is material to the examination of this application in accordance with 37 CFR 1.56(a). 
[_] In compliance with this duty, attached is an Information Disclosure Statement under 37 CFR 1.97. 



35 USC 119 PRIORITY: I claim priority benefits under Title 35, United States Code, Section 1 19(e) of any 
provisional application(s) for patent listed below: 



Country 


Application Number 


Date of Filing 


Priority Claimed Under 35 USC 1 19 



[ X ] Yes [ ] No 



35 USC 120 BENEFIT: I claim the benefit under Title 35, United States Code, Section 120 of any United States 
application(s) listed below and, insofar as the subject matter of each of the claims of this application is not 
disclosed in the prior United States application in the manner provided by the first paragraph of Title 35, United 
States Code, Section 1 12, I acknowledge the duty to disclose material information as defined in Title 37, Code of 
Federal Regulations, Section 1.56(a) which occurred between the filing date of the prior application and the 
national or PCT international filing date of this application: 



Application Serial No. 



Date of Filing 



Status 



[ ] Patented [ ] Pending [ j Abandoned 



POWER OF ATTORNEY: As a named inventor, I hereby appoint the following attorney(s) and/or agent(s) to 
prosecute this application and transact all business in the Patent and Trademark Office connected therewith: : 

Steven P. Koda (Reg. No. 32,252) 



Send Correspondence to: 

KODA LAW OFFICE 

P.O. Box 10057 

Bainbridge Island, WA 98110 



Direct Telephone Calls To: 

Steven P. Koda 
(206) 780-8336 
FAX (206) 780-8353 
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KODA LAW OFFICE Atty. Docket No. OT2.P59 

DECLARAT""** 1 AND POWER OF ATTORNEY 

INVENTOR(S): 
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Inventor Full 
Name 


Last Name 
Kim 


First Name 

Donglok 


Middle Name or Initial 


Residence & 
Citizenship 


City 
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State or Foreign Country 
Washington 


Country of Citizenship 
Republic of Korea 


Post Office 
Address 


Post Office Address 

5290 Mithun Place, N.E. 


.City 

Seattle 


State or Country 

Wa. 


Zip code 

98105 
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Inventor Full 
Name 


Last Name 

Berg 


First Name 

Stefan 


Middle Name or Initial 

G. 


Residence & 
Citizenship 


City 

Seattle 


State or Foreign Country 

Washington 


Country of Citizenship 

U.S. 


Post Office 
Address 


Post Office Address 

5212 University Way NE Apt 203 


City 

Seattle 


State or Country 

Wa. 


Zip code 

98105 
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Inventor Full 
Name 


Last Name 

Sun 


First Name 

Weiyun 


Middle Name or Initial 


Residence & 
Citizenship 


City 
Seattle 


State or Foreign Country 

Washington 


Country of Citizenship 

Peoples Republic 
of China 


Post Office 
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Post Office Address 

818 NE 106th Street 


City 

Seattle 


State or Country 
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Zip code 

98125 
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Inventor Full 
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Last Name 
Kim 


First Name 
Yongmin 


Middle Name or Initial 
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State or Foreign Country 
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Country of Citizenship 

Republic of Korea 
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Seatde 


State or Country 

Wa. 


Zip code 

98155 



18 USC 1001 DECLARATION: I further declare that all statements made herein of my own knowledge are 
true and that all statements made on information and belief are believed to be true; and farther that these 
statements were made with knowledge that willful false statements and the like so made are punishable by fine or 
imprisonment, or both, under section 1001 of Title 18 of the United States Code, and that such willful false 
statements may jeopardize the validity of the application or any patent issuing thereon. 

SIGNATURES: 



Signature Inventor 1 (Donglok Kim) 


Signature Inventor 2 (S. Berg) 


Signature Inventor 3 (W. Sun) 


Date 


Date 


Date 


Signature Inventor 4 (Yongmin Kim) 


Signature Inventor 5 


Signature Inventor 6 


Date 


Date 


Date 
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KODA LAW OFFICE „ ^ . Atty. Docket No. OT2.PS9 

DECLARATION AND POWER OF ATTORNEY 
37 CFR 1.63 DECLARATION: As a below named inventor, I hereby declare that my residence, post office 
address and citizenship are as stated below next to my name. I believe I am the original, first and sole inventor (if 
only one name is listed below) or an original, first and joint inventor (if plural inventors are named below) of the 
subject matter which is claimed and for which a patent is sought on the invention entitled: Method and 
Apparatus for Compressing VLIW Instruction and Sharing Substructions, the specification of which [X] is 
attached hereto, or [ ] was filed on _ as Application Serial No. _ and was amended on _ (if applicable). 

DUTY OF DISCLOSURE: I have reviewed and understand the contents of the above identified application, 
including the claims, as amended by any amendment referred to above. I acknowledge the duty to disclose 
information which is material to the examination of this application in accordance with 37 CFR 1.56(a). 
[_] In compliance with this duty, attached is an Information Disclosure Statement under 37 CFR 1.97. 

35 USC 119 PRIORITY: I claim priority benefits under Title 35, United States Code, Section 1 19(e) of any 
provisional application(s) for patent listed below: 



Country 



Application Number 



Date of Filing Priority Claimed Under 35 USC 1 19 j 



£ X ] Yes [ ] No 



35 USC 120 BENEFIT: I claim the benefit under Title 35, United States Code, Section 120 of any United States 
application(s) listed below and, insofar as the subject matter of each of the claims of this application is not 
disclosed in the prior United States application in the manner provided by the first paragraph of Title 35, United 
States Code, Section 1 12, 1 acknowledge the duty to disclose material information as defined in Title 37, Code of 
Federal Regulations, Section 1.56(a) which occurred between the filing date of the prior application and the 
national or PCT international filing date of this application: 



Application Serial No. 



Date of Filing 



Status 



[ 3 Patented [ ] Pending [ ] Abandoned 



POWER OF ATTORNEY: As a named inventor, I hereby appoint the following attorney(s) and/or agent(s) to 
prosecute this application and transact all business in the Patent and Trademark Office connected therewith: : 

Steven P. Koda (Reg. No. 32,252) 



Send Correspondence to: 

KODA LAW OFFICE 

P.O. Box 10057 

Bainbridge Island, WA 98110 



Direct Telephone Calls To: 

Steven P. Koda 
(206) 780-8336 
FAX (206) 780-8353 
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18 USC 1001 DECLARATION: I further declare that all statements made herein of my own knowledge are 
true and that all statements made on information and belief are believed to be true; and further that these 
statements were made with knowledge that willful false statements and the like so made are punishable by fine or 
imprisonment, or both, under section 1001 of Title 18 of the United States Code, and that such willful false 
statements may jeopardize the validity of the application or any patent issuing thereon. 
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